Import necessary libraries:

1) Load and view the dataset

Let's check the number of unique values in each column:

Summary of the data:

Let's check the count of each unique category in each of the categorical variables:

2) Data Cleaning

Combining 'College' and 'Graduate' categories:

3) EDA

3.1) Univariate Analysis

Observations on Customer_Age:

Observations on Months_on_book:

Observations on Credit_Limit:

Observations on Total_Revolving_Bal:

Observations on Avg_Open_To_Buy:

Observations on Total_Amt_Chng_Q4_Q1:

Observations on Total_Ct_Chng_Q4_Q1:

Observations on Total_Trans_Amt:

Observations on Total_Trans_Ct:

Observations on Avg_Utilization_Ratio:

Observations on Attrition_Flag:

Observations on Gender:

Observations on Education_Level:

Observations on Marital_Status:

Observations on Income_Category:

Observations on Card_Category:

Observations on Dependent_count:

Observations on Total_Relationship_Count:

Observations on Months_Inactive_12_mon:

Observations on Contacts_Count_12_mon:

3.2) Bivariate Analysis

Attrition_Flag vs Numerical Features:

Attrition_Flag vs Gender:

Attrition_Flag vs Education_Level:

Attrition_Flag vs Marital_Status:

Attrition_Flag vs Income_Category:

Attrition_Flag vs Card_Category:

3.3) Other Exploratory Deep Dive

4) Data Pre-processing

Dropping columns with strong correlationship as discussed earlier:

Ranking ordered variable:

Treating Outliers:

5) Data Preparation

Imputing unknown values with KNN imputer:

Split the data into train and test sets:

Checking inverse mapped values/categories:

6) Model building

Model evaluation criterion

Model can make wrong predictions as:

  1. Predicting an customer will churn and the customer doesn't churn
  2. Predicting an customer will not churn and the customer churn

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

6.1) Logistic Regression

First let's create two functions to calculate different metrics and confusion matrix, so that we don't have to use the same code repeatedly for each model.

Let's evaluate the model performance by using KFold and cross_val_score

6.1.1) Oversampling train data using SMOTE

Logistic Regression on oversampled data

Let's evaluate the model performance by using KFold and cross_val_score

6.1.2) Undersampling train data using RandomUnderSampler

Logistic Regression on undersampled data

Let's evaluate the model performance by using KFold and cross_val_score

6.2) Bagging and Boosting

Building different models using KFold and cross_val_score with pipelines and tune the best model using GridSearchCV and RandomizedSearchCV:

7) Hyperparameter Tuning

We will use pipelines with StandardScaler and AdaBoost model and tune the model using GridSearchCV and RandomizedSearchCV. We will also compare the performance and time taken by these two methods - grid search and randomized search.

We can also use the make_pipeline function instead of Pipeline to create a pipeline.

make_pipeline: This is a shorthand for the Pipeline constructor; it does not require and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically.

Let's create a new get_metrics_score function such that we don't have to input the same data repeatly for each model.

7.1) XGBoost

GridSearchCV:

RandomizedSearchCV:

7.2) AdaBoost

GridSearchCV:

RandomizedSearchCV:

7.3) Gradient Boosting Classifier

GridSearchCV:

RandomizedSearchCV:

Comparing all models

8) Actionable Insights & Recommendations